It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct our analysis.
Section 1: Initial Steps
Section 2: Data Cleaning and Preparation, Feature Engineering
Section 3: EDA of all variables and binning
Section 4: Random Forest Models
Random Forest is a type of Bagging Model. The term Bagging comes from Bootstrap Aggregating. It builds many models independently and then averages their predictions. The best-known example is the random forest technique. The random forest method builds many decision trees, and then takes the average for the outcomes of all the decision trees. Further, the random forest technique draws some samples to build a model, then draws some samples again to build another model, and so on. All of these sampling and modeling are done independently and simultaneously as shown in Figure (II). The final outcome is the average of the predicted values of all the models.
H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.
# Import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import plotly
import plotly.express as px
from imblearn.datasets import make_imbalance
import pylab as pl
from collections import Counter
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 2 EDA/XYZloan_default_selected_vars.csv')
df.head(2)
print("Number of rows and columns in the dataset:")
df.shape
# Check basic statistics
print("Basic statistics of the columns are as follows:")
df.describe()
AP006¶df['AP006'].hist()
df.AP006.hist()
df['AP006'].value_counts()
loan_default¶# Check the target variable column
print("The number of 0's and 1's are:")
print(df['loan_default'].value_counts())
df['loan_default'].hist()
#df.info()
Unnamed: 0, Unnamed: 0.1 and id. They need to be dropped.
AP005 is a Date-Time column, which cannot be used for any predictions in the model. Date-Time columns act as an ID column and all have unique values, which misrepresents the variable while making predictions. The reason is that this field almost becomes a unique identifier for each record. It is as if you employ the ‘id’ field in your decision trees. I will derive year, month, day, weekday, etc. from this field. In some models, we may use ‘year’ as a variable just to explain any special volatility in the past. But we will never use the raw DateTime field as a predictor.
TD025, TD026, TD027, TD028, CR012.
TD029, TD044, TD048, TD051, TD054, TD055, TD061, TD062.
AP002Gender,
AP003Education Code,
AP004Loan Term,
AP006OS Type,
AP007Application City Level,
AP008Flag if City not Application City,
AP009 Binary format,
MB007 Mobile Brands/type
AP005 to the relevant formats of Year, Month, Day ¶df['AP005'] = pd.to_datetime(df['AP005'])
# Create 4 new columns
df['Loan_app_day_name'] = df['AP005'].dt.day_name()
df['Loan_app_month'] = df['AP005'].dt.month_name()
df['Loan_app_time'] = df['AP005'].dt.time
df['Loan_app_day'] = df['AP005'].dt.day
# Drop old column
df = df.drop(columns = ['AP005'])
df.head(2)
df["AP002"] = df["AP002"].astype('object')
df["AP003"] = df["AP003"].astype('object')
df["AP004"] = df["AP004"].astype('object')
df["AP006"] = df["AP006"].astype('object')
df["AP007"] = df["AP007"].astype('object')
df["AP008"] = df["AP008"].astype('object')
df["AP009"] = df["AP009"].astype('object')
df["CR015"] = df["CR015"].astype('object')
df["MB007"] = df["MB007"].astype('object')
df['Loan_app_day_name'] = df['Loan_app_day_name'].astype('object')
df['Loan_app_month'] = df['Loan_app_month'].astype('object')
df['Loan_app_time'] = df['Loan_app_time'].astype('object')
df['Loan_app_day'] = df['Loan_app_day'].astype('object')
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1', 'id', 'TD025', 'TD026', 'TD027', 'TD028', 'CR012','TD029', 'TD044', 'TD048', 'TD051', 'TD054', 'TD055', 'TD061', 'TD062'])
df.head(2)
As per all the variable description, all the following columns are either counts, lengths, or days. Hence, the negative values such as -999, -99, -98, -1, etc are all mis-read NA's and need to be converted back to 'nan' format.
features_nan = ['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
# Define a function to convert negatives to nan
def convert_to_nan(var):
df[var][df[var] < 0] = np.nan
for i in features_nan:
convert_to_nan(i)
# Verify that the negatives are gone
print("The minimum now stands at 0 for most of the columns, verifying the mis-represented values are gone.")
df[features_nan].describe()
Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
df_2 = df[features_nan]
# Verify
df_2.head(3)
imp = IterativeImputer(missing_values=np.nan, sample_posterior=False,
max_iter=10, tol=0.001,
n_nearest_features=None, initial_strategy='median')
imp.fit(df_2)
imputed_data_median = pd.DataFrame(data=imp.transform(df_2),
columns=['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'],
dtype='int')
imputed_data_median.head(3)
CR009 to a category variable and bin appropriately ¶df['CR009'] = pd.cut(x=df['CR009'], bins=[-1, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1500000])
df = df.astype({'CR009':'object'})
df.CR009.value_counts()
corr = df[['loan_default', 'AP001', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024']].corr()
f,ax = plt.subplots(figsize=(18,12))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feeature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables
corr_var_drop1 = ['TD005', 'TD022', 'TD006', 'TD009', 'TD013', 'TD023', 'TD010', 'TD014']
I will be using the other variables as they are all Call detail data.
filter_col = [col for col in df if col.startswith('CD')]
filter_col.append('loan_default')
corr = df[filter_col].corr()
f,ax = plt.subplots(figsize=(21,21))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables
corr_var_drop2 = ['CD173', 'CD172', 'CD170', 'CD169', 'CD167', 'CD166', 'CD164', 'CD162',
'CD137', 'CD136', 'CD135', 'CD133', 'CD132', 'CD131', 'CD117', 'CD118',
'CD120', 'CD121', 'CD123', 'CD114', 'CD113', 'CD108', 'CD107', 'CD106',
'CD101', 'CD072']
df_bin = df.copy(deep = True)
df_bin.head(2)
# Write a function and loop through
def binning(var):
df_bin[var + '_bin'] = pd.qcut(df_bin[var],15,duplicates='drop').values.add_categories("NoData")
df_bin[var + '_bin'] = df_bin[var + '_bin'].fillna("NoData").astype(str)
df_bin[var + '_bin'].value_counts(dropna=False)
features = ['AP001', # 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
# 'AP008', 'AP009',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
#'CR009', 'CR015',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'
# 'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
# 'Loan_app_day'
]
for i in features:
binning(i)
# View the bins of some variables
print(df_bin['TD001_bin'].value_counts(dropna=False))
print(df_bin['TD022_bin'].value_counts(dropna=False))
% Y by X which is the mean column for all the numerical columns here ¶The 'mean' column represents the '% Y by X'.
def plot_X_and_Y(var):
z = df_bin.groupby(var + '_bin')['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var + '_bin']
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
#plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
#plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
#for i in features:
# plot_X_and_Y(i)
% Y by X which is the mean column for all the Categorical columns here ¶The 'mean' column represents the '% Y by X'.
features_2 = ['AP002', 'AP003', 'AP004', 'AP006', 'AP007', 'AP008', 'AP009',
'CR009','CR015', 'MB007', 'Loan_app_day_name', 'Loan_app_month',
'Loan_app_day'
]
def plot_X_and_Y_cat(var):
z = df_bin.groupby(var)['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var]
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
for i in features_2:
plot_X_and_Y_cat(i)
From the above graphs, the following variables seem to be not important, as they do not have a pattern or a trend, or a curve on the '% Y by x' graph:
Loan_app_day_namedf_count = df['AP006'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP006 - OS Type','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP006 - OS Type', y = 'Count', color = 'AP006 - OS Type',
width=600, height=400,
title = "Distribution of OS type")
fig.show()
df_count = df['AP002'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP002 - Gender','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP002 - Gender', y = 'Count', color = 'AP002 - Gender',
width=600, height=400,
title = "Distribution of Gender")
fig.show()
df_count = df['AP003'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP003 - Education','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP003 - Education', y = 'Count', color = 'AP003 - Education',
width=600, height=400,
title = "Distribution of Education")
fig.show()
fig = px.box(df, x="TD001",width=1000, height=500,
title = "Distribution of TD001 - TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = px.box(df, x="MB005",width=1000, height=500,
title = "Distribution of MB005")
fig.show()
fig = px.box(df, x="AP007", y="TD001",width=900, height=400,
color = "AP002",
title = "The Distribution of Level Application City by TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = sns.pairplot(df[['AP002', 'AP003', 'AP004']],
hue= 'AP004')
fig
# Over write the NA value columns, with the previously calculated values
df[features_nan] = imputed_data_median
df.head(2)
df.isnull().sum().sum()
import h2o
h2o.init()
from h2o.estimators.random_forest import H2ORandomForestEstimator
train,test = train_test_split(df,test_size=0.25, random_state = 1234)
# Convert to a h2o dataframe for computation
df_hex = h2o.H2OFrame(df)
train_hex = h2o.H2OFrame(train)
test_hex = h2o.H2OFrame(test)
# This test_hex will be used all througout the models
# Selecting all predictor variables
predictors = ['AP001', 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
'AP008', 'AP009', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR009', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005',
'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
'Loan_app_day']
target = 'loan_default'
len(predictors)
len(df.columns.to_list())
RF_h2o_modl1 = H2ORandomForestEstimator(
model_id = 'RF_h2o_modl1',
ntrees = 400,
nfolds=10,
min_rows=100,
seed=1234)
RF_h2o_modl1.train(predictors,target,training_frame = train_hex)
var_imps = pd.DataFrame(RF_h2o_modl1.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
The model total has 75 features. After running the initial model with all features, I run many different combinations to get the best LIFT score. I see that the LIFT is best with the top 55 features in the model. So, I will select the Top 55 features, and re-run the model.
predictors = var_imps['Variable'].head(55).to_list()
target = 'loan_default'
RF_h2o_modl1 = H2ORandomForestEstimator(
model_id = 'RF_h2o_modl1',
ntrees = 400,
nfolds=10,
min_rows=100,
seed=1234)
RF_h2o_modl1.train(predictors,target,training_frame = train_hex)
var_imps = pd.DataFrame(RF_h2o_modl1.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
def VarImp(model_name,m_name):
# plot the variable importance
plt.rcdefaults()
variables = model_name._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
fig, ax = plt.subplots(figsize = (8,16))
scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos,scaled_importance,align='center',color='green')
ax.set_yticks(y_pos)
ax.set_yticklabels(variables)
ax.invert_yaxis()
ax.set_xlabel('Scaled Importance')
ax.set_title(m_name + ' Variable Importance in Decreasing Order')
plt.show()
VarImp(RF_h2o_modl1,'Random Forest')
Use the original test_hex set for prediction
def actual_predict(model,test_hex,target):
y_pred = model.predict(test_hex).as_data_frame()
y_actual = test_hex[target].as_data_frame()
df_actual_predict = pd.concat([y_actual,y_pred],axis=1)
df_actual_predict.columns = ['actual','pred']
return(df_actual_predict)
RF_actual_predict = actual_predict(RF_h2o_modl1,test_hex,target)
RF_actual_predict.head()
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
def gains_table(df_actual_predict):
df_actual_predict = df_actual_predict.sort_values(by='pred',ascending=False)
df_actual_predict['row_id'] = range(0,0+len(df_actual_predict))
df_actual_predict['decile'] = (df_actual_predict['row_id'] / (len(df_actual_predict)/10)).astype(int)
df_actual_predict.loc[df_actual_predict['decile'] == 10] =9
# Create gains table
gains = df_actual_predict.groupby('decile')['actual'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain'] = (gains['cum_actual'] / gains['cum_count']*100).round(2)
return(gains)
RF_gains = gains_table(RF_actual_predict)
RF_gains
def ROC_PR(df_actual_predict):
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# ROC
roc_auc_value = roc_auc_score(df_actual_predict['actual'],df_actual_predict['pred'])
fpr, tpr, _ = roc_curve(df_actual_predict['actual'],df_actual_predict['pred'])
roc_auc = auc(fpr,tpr)
lw=2
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.plot(fpr,tpr, color='darkorange',lw=lw,label='AUC = %0.4f)' %roc_auc_value)
plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: AUC={0:0.4f}'.format(roc_auc_value))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
average_precision = average_precision_score(df_actual_predict['actual'],df_actual_predict['pred'])
precision, recall, _ = precision_recall_curve(df_actual_predict['actual'],df_actual_predict['pred'])
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.ylim([0.0,1.05])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
ROC_PR(RF_actual_predict)
balance_classes option can be used to balance the class distribution. When enabled, H2O will either undersample the majority classes or oversample the minority classes. If this option is enabled, then we can also specify a value for the class_sampling_factors and max_after_balance_size options to control the sampling:
class_sampling_factors takes a list of numbers which would be the sampling rate for each class. A value of 1 would not change the sample rate for a class, but setting it to 0.5 would reduce its sampling by half, and 2 would double its sample rate.max_after_balance_size is the max relative size your training data can be grown. By default, it is 5: this will oversample the data to rebalance the training data. The max it can grow to is 5x larger than your original data, hence, the value of 5. If you have many rows and prefer to under-sample the majority class, you can set max_after_balance_size to a value of less than 1.
See this H2O page.'
RF_h2o_modl2 = H2ORandomForestEstimator(
model_id = 'RF_h2o_modl2',
ntrees = 400,
nfolds=10,
min_rows=100,
balance_classes = True,
seed=1234)
RF_h2o_modl2.train(predictors,target,training_frame = train_hex)
RF_actual_predict = actual_predict(RF_h2o_modl2,test_hex,target)
RF_actual_predict.head()
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
RF_gains = gains_table(RF_actual_predict)
RF_gains
ROC_PR(RF_actual_predict)
I see that in both the random forest models, first without the balance classes, and the second one with balance classes, the output result is the same.
This is probably because in our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. Hence, the underlying assumption for the same results in both models is that this dataset is not a bad distribution of class, and 20% data in a one class is fine for building a stable model. If we had more than 5-6 times of the minority class data in the majority class, then that would be be settled well by the balance classes parameter.
Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score of 2.11 in both the models. So, I will continue with the same split ratio further.
The optimum hyper-parameters for both the models built above are:
Please see below for the meaningful business insights:
H2O package is a very effective and efficint package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.
Gains table and Lift:
For the final model I built after running tuning the models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 2.11, which is good as per industry standards. A Lift score of above 2 is suitablefor the model to be of acceptable standards.
ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.
However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us. We get an AUC of 0.68 and PR of 0.34, which is acceptable, but I will try to improve them further in my following models.
The major outcome of this excercise is that random forest is a good model to predict loan default, as per the given data. However, we should not undermine other good boosting models such as gbm, xgboost, or Auto-ML and others. These might provide better results as well.
# Select a few numeric variables for the PCA
num_variables = ['loan_default', 'AP001', 'TD001',
'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
# Create a new temporary df with only the numeric columns for PCA
df_new = df[num_variables]
target = 'loan_default'
y = df_new[target]
X = df_new.drop(target,axis=1)
y.dtypes
y1_cnt = df_new[target].sum()
print(y1_cnt)
# Taking twice the number of 0's as compared to 1's. Currently, its 4 times.
N = 2
y0_cnt = y1_cnt * N
y0_cnt
y.value_counts()
# Adjust the temporary dataset as per the 2:1 ratio we defined above
X_rs, y_rs = make_imbalance(X, y,
sampling_strategy={1:y1_cnt , 0: y0_cnt},
random_state=0)
#X_rs = pd.DataFrame(X_rs)
#y_rs = pd.DataFrame(y_rs)
y_rs.value_counts()
smpl = pd.concat([X_rs,y_rs], axis = 1)
# Verify
smpl['loan_default'].value_counts()
For ease of visualization, I use principal component analysis to reduce the dimensions and pick the first two principal components. A scatterplot for the dataset is shown below.
def plot_this(X_rs,y_rs,method):
# Use principal component to condense the features to 2 features
pca = PCA(n_components=2).fit(X_rs)
pca_2d = pca.transform(X_rs)
# Assign colors
for i in range(0, pca_2d.shape[0]):
if y_rs[i] == 0:
c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='o')
elif y_rs[i] == 1:
c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='*')
pl.legend([c1, c2], ['Class 1', 'Class 2'])
pl.title(method)
pl.axis([-4, 5, -4, 4]) # x axis (-4,5), y axis (-4,4)
pl.show()
# plot_this(X,y,'Original')
# This takes a lot of time to run, I will comment it
print('Random UnderSampler {}'.format(Counter(y_rs)))
#plot_this(X_rs,y_rs,'Random undersampling')
# This takes a lot of time to run, I will comment it
I will manually perform the Random Under Sampling using the functon from the imblearn package, and then build the H2O Random Forest Model.
df2 = df.copy()
target = 'loan_default'
y = df2[target]
X = df2.drop(target,axis=1)
y.dtypes
# Check Statistics
y1_cnt = df2[target].sum()
print(y1_cnt)
df['loan_default'].value_counts()
print("The percent of data with value = 1 or loan default = 1 in the target column is")
(df2['loan_default'].value_counts(dropna=False)[1]/df2.shape[0])*100
# I plan to reduce the majorty class to only double of the minority class, its currently 4 times
df['loan_default'].value_counts(dropna=False)[1]*2
y.value_counts()
from imblearn.under_sampling import RandomUnderSampler
# Here, I will decrease the size of the bigger sample class to 2 times the minority
sampler = RandomUnderSampler(sampling_strategy={1: 15488, 0: 30976},random_state=0)
X_rs, y_rs = sampler.fit_sample(X, y)
print('RandomUnderSampler {}'.format(Counter(y_rs)))
# Verify
y_rs.value_counts()
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
smpl = pd.concat([X_rs,y_rs], axis = 1)
smpl_hex = h2o.H2OFrame(smpl)
An important note here is that, I will be using the entire sample data, which is essentially a mis-represented but balanced dataset for the training purpose.
But, for the predicting on the test data, I will be using the original test_hex data that I had defined above. I do not want to be predicting on a mis-represented test data. Hence, for this manual under sampling model, train_test split has been ignored, as the test_hex already exists.
rf_u_sample = H2ORandomForestEstimator(
model_id = 'rf_u_sample',
ntrees = 400,
nfolds=10,
min_rows=100,
seed=1234)
rf_u_sample.train(predictors,target,training_frame=smpl_hex)
RF_actual_predict = actual_predict(rf_u_sample,test_hex,target)
RF_actual_predict.head()
# The 'test_hex' which will be used here will be original test data, not a test split from the new balanced dataset.
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
RF_gains = gains_table(RF_actual_predict)
RF_gains
ROC_PR(RF_actual_predict)
I will manually perform the Random Over Sampling using the functon from the imblearn package, and then build the H2O Random Forest Model.
df2 = df.copy()
target = 'loan_default'
y = df2[target]
X = df2.drop(target,axis=1)
y.dtypes
# Check Statistics
y1_cnt = df2[target].sum()
print(y1_cnt)
print(df2['loan_default'].value_counts())
print("The percent of data with value = 1 or loan default = 1 in the target column is")
(df2['loan_default'].value_counts(dropna=False)[1]/df2.shape[0])*100
# I plan to triple the minority class
df['loan_default'].value_counts(dropna=False)[1]*4
from imblearn.over_sampling import RandomOverSampler
# RandomOverSampler
# With over-sampling methods, the number of samples in a class
# should be greater or equal to the original number of samples.
# Here, I will increase the size of the smaller sample class to 2 times its original
sampler = RandomOverSampler(sampling_strategy={1: 64512, 0: 64512},random_state=0)
X_rs, y_rs = sampler.fit_sample(X, y)
print('RandomOverSampler {}'.format(Counter(y_rs)))
# plot_this(X_rs,y_rs)
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
smpl = pd.concat([X_rs,y_rs], axis = 1)
train,test = train_test_split(smpl,test_size=0.25, random_state = 1234)
# Convert to a h2o dataframe for computation
smpl_hex = h2o.H2OFrame(smpl)
train_hex = h2o.H2OFrame(train)
An important note here is that, I will be using the entire sample data, which is essentially a mis-represented but balanced dataset for the training purpose.
But, for the predicting on the test data, I will be using the original test_hex data that I had defined above. I do not want to be predicting on a mis-represented test data. Hence, for this manual under sampling model, train_test split has been ignored, as the test_hex already exists.
rf_o_sample = H2ORandomForestEstimator(
model_id = 'rf_o_sample',
ntrees = 400,
nfolds=10,
min_rows=100,
seed=1234)
rf_o_sample.train(predictors,target,training_frame=smpl_hex)
var_imps = pd.DataFrame(RF_h2o_modl1.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
VarImp(RF_h2o_modl1,'Random Forest')
RF_actual_predict = actual_predict(rf_o_sample,test_hex,target)
RF_actual_predict.head()
# The 'test_hex' which will be used here will be original test data, not a test split from the new balanced dataset.
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
RF_gains = gains_table(RF_actual_predict)
RF_gains
print()
print("The final best Lift score is: ")
print()
print(RF_gains['lift'][0])
ROC_PR(RF_actual_predict)
Here, I ran the random forest models, first with manual undersampling, and the second one with manual over sampling, the output differs largely in both models.
In our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. So, doing manual under and over sampling helped change the model performance significantly.
Infact, in manual undersampling, the Lift score went up from 2.11 to 2.38 in my best model, and in manual over sampling, the Lift score went further up to 3.01.
Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score in all the models.
The optimum hyper-parameters for the Over sampling model with the best Lift score of 3.01 are:
Please see below for the meaningful business insights:
H2O package is a very effective and efficient package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.
Gains table and Lift:
For the final model I built after tuning all the different models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 3.01, which is very good as per industry standards. A Lift score of above 2 is suitable for the model to be of acceptable standards.
ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.
However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us.
In my case, the AUC is 0.79 and PR score is 0.50 which is fine and acceptable.
The major outcome of this exercise is that random forest through the H2O package is a good approach to predict loan default. However, we should not undermine other good boosting models such as gbm, xgboost, or Auto-ML and others. These might provide better results as well.